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Abstract 

We discuss multi-task online learning when a de- 
cision maker has to deal simultaneously with M 
tasks. The tasks are related, which is modeled 
by imposing that the M -tuple of actions taken by 
the decision maker needs to satisfy certain con- 
straints. We give natural examples of such restric- 
tions and then discuss a general class of tractable 
constraints, for which we introduce computation- 
ally efficient ways of selecting actions, essentially 
by reducing to an on-line shortest path problem. 
We briefly discuss "tracking" and "bandit" versions 
of the problem and extend the model in various 
ways, including non-additive global losses and un- 
countably infinite sets of tasks. 



1 Introduction 

Multi-task learning has recently received considerable atten- 
tion, see IDLS07I lABROTl |Men07l TCCBGOSI . In multi-task 
learning problems, one simultaneously learns several tasks 
that are related in some sense. The relationship of the tasks 
has been modeled in different ways in the literature. In our 
setting, a decision maker chooses an action simultaneously 
for each of M given tasks, in a repeated manner. (To each 
of these tasks corresponds a game, and we will use inter- 
changeably the concepts of game and task.) The relatedness 
is accounted for by putting some hard constraints on these 
simultaneous actions. 

As a motivating example, consider a distance-selling com- 
pany that designs several commercial offers for its numerous 
customers, and the customers are ordered (say) by age. The 
company has to choose whom to send which offer A loss 
of earnings is suffered whenever a customer does not receive 
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the commercial offer that would have been best for him. Ba- 
sic marketing considerations suggest that offers given to cus- 
tomers with similar age should not be very different, so the 
company selects a batch of offers that satisfy such a con- 
straint. Additional budget constraint may limit further the 
set of batches from which the company may select. After the 
offers are sent out, the customers' responses are observed 
(at least partially) and new offers are selected and sent. We 
model such situations by playing many repeated games si- 
multaneously with the restriction that the vector of actions 
that can be selected at a time needs to belong to a previously 
given set. This set in determined beforehand by the budget 
and marketing constraints discussed above. The goal of the 
decision maker is to minimize the total accumulated regret 
(across the many games and through time), that is, perform, 
on the long run, almost as well as the best constant vector of 
actions satisfying the constraint. 

The problem of playing repeatedly several games simul- 
taneously has been considered by |Men07| who studies con- 
vergence to Nash equilibria but does not address the issue 
of computational feasibility when a large number of games 
is played. On-line multi-task learning problems were also 
studied by IIABR07I and IIDLS 071. As the latter reference, 
we consider minimizing regret simultaneously in parallel, by 
enforcing however some hard constraints. As [ABROTJ, we 
measure the total loss as the sum of the losses suffered in 
each game but assume that all tasks have to be performed at 
each round. (This assumption is, however relaxed in Sec- 
tion [8] where we consider global losses more general than 
the sums of losses.) The main additional difficulty we face 
is the requirement that the decision maker chooses from a 
restricted subset of vectors of actions. In previous models 
restrictions were only considered on the comparison class, 
but not on the way the decision maker plays. 

We formulate the problem in the framework of on-line re- 
gret minimization, see [CBL06| for a survey. The main chal- 
lenge is to construct a strategy for playing the many games 
simultaneously with small regret such that the strategy has 
a manageable computational complexity. We show that in 
various natural examples the computational problem may be 
reduced to an online shortest path problem in an associated 
graph for which well-known efficient algorithms exist. (We 
however propose a specific scheme for implementation that 
is slightly more effective.) 

The results can be extended easily to the "tracking" case 



in which the goal of the decision maker is to perform as 
well as the best strategy that can change the vector of ac- 
tions (taken from the restricted set) at a limited number of 
times. We also consider the "bandit" version of the problem 
when the decision maker, instead of observing the losses of 
all actions in all games, only learns the sum of the losses of 
the chosen actions. 

Finally, we also consider cases when there are infinitely 
many tasks, indexed by real numbers. In such cases the deci- 
sion maker chooses a function from a certain restricted class 
of functions. We show examples that are natural extensions 
of the cases we consider for finitely many tasks and discuss 
the computational issues that are closely related to the theory 
of exact simulation of continuous-time Markov chains. 

We concentrate on exponentially weighted average fore- 
casters because, when compared to its most likely competi- 
tors, that is, foUow-the-leader-type algorithms, they have bet- 
ter performance guarantees, especially in the case of bandit 
feedback. Besides, the two families of forecasters, as pointed 
out by LABR07i , usually have implementation complexities 
of the same order. 

2 Setup and notation 

In the simplest model studied in this paper, a decision maker 
deals simultaneously with M tasks, indexed by j = 1, . . . , M. 
For simplicity, we assume that all games share the same fi- 
nite action space X = {xi, . . . ,xn} C M. (Here, we do 
not identify actions with integers but with real numbers, for 
reasons that will be clear in Section[3]) 

To each tasks j = 1, . . . ,M there is an associated out- 
come space yj and a loss function ^'^-'^ : X x yj ^ [0, 1]. 
We denote by x = (x^j , . . . , Xk^ ) the elements of X^'^ 
and call them vectors of simultaneous actions. The tasks 
are played repeatedly and at each round t = 1,2,..., the 
decision maker chooses a vector Xf = {Xi^t ■ ■ ■ , Xma) G 
X^'^ of simultaneous actions. (That is, he chooses indexes 
Ki^t, ■ ■ ■ ,KM,t e {1, ... ,N} and Xj^t = Xif^ ^ for all 
j — 1, . . . , M.) We assume that the choice of X( can be 
made at random, according to a probability distribution over 
X^ which will usually be denoted by pf. The behavior of 
the opponent player among all tasks is described by the vec- 
tor of outcomes yt = (?/i,t, . . . , yM,t)- 

We are interested in the loss suffered by the decision 
maker and we do not assume any specific probabilistic or 
strategic behavior of the environment. In fact, the outcome 
vectors yt, for t = 1,2,..., can be completely arbitrary and 
we measure the performance of the decision maker by com- 
paring it to the best of a class of reference strategies. The 
total loss suffered by the decision maker at time t is just the 
sum of the losses over tasks: 

M 

The important point is that the decision maker has some re- 
strictions to be obeyed in each round, which we also call 
hard constraints. They are modeled by a subset A of the set 
of possible simultaneous actions X^^; the forecaster is only 
allowed to play vectors X( in A. This subset A captures the 
relatedness among the tasks. 



The decision maker aims at minimizing his regret, de- 
fined by the difference of his cumulative loss with respect to 
the cumulative loss of the best constant vector of actions, de- 
termined in hindsight, among the set of allowed vectors A. 
Formally, the regret is defined by 

n n 

Rn^J2 ^(^*' - min ^ £(x, y*) . 
t=i t=i 

In the basic, full information, version of the problem the de- 
cision maker, after choosing X^, observes the vector of out- 
comes yt. In the bandit setting, only the total loss ^(Xf , yt) 
becomes available to the decision maker. 

Observe that in the case of M = 1 task, the problem 
reduces to the well-studied problem of "on-line prediction 
with expert advice" or "sequential regret minimization," see 
|CBL06| for the history and basic results. This is also the 
case when M ^ 2 but A = X'^^ , since the decision maker 
could then treat each task independently from others and 
maintain M parallel forecasting schemes, at least in the full- 
information setting. Under the bandit assumption the prob- 
lem becomes the "multi-task bandit problem" discussed in 
I1CBL0 9I, which is also easy to solve by available techniques. 
However, when ^ is a proper subset of A"*^, interesting com- 
putational problems arise. The efficient implementation we 
propose requires a condition the set A of restrictions needs to 
satisfy. This structural condition, satisfied in several natural 
examples discussed below, permits us to reduce the problem 
to the well-studied problem of predicting as well as the best 
path between two fixed vertices of a graph. 

In order to make the model meaningful, just like in the 
most basic versions of the problem, we allow the decision 
maker to randomize its decision in each period. More for- 
mally, at each round of the repeated game, the decision maker 
determines a distribution on X'^^ (restricted to the set A) and 
draws the action vector Xt according to this distribution. Be- 
fore determining the outcomes, the opponent may have ac- 
cess to the probability distribution the decision maker uses 
but not to the realizations of the random variables. 

Structure of the paper 

We start by stating some natural examples on which the pro- 
posed techniques will be illustrated. We then study the full- 
information version of the problem (when the decision maker 
observes all past outcomes before determining his proba- 
bility distribution) by proposing first a hypothetical scheme 
with good performance and then stating an efficient imple- 
mentation of it. 

We also consider various extensions. One of them is the 
bandit setting, when only the sum of losses of the chosen 
simultaneous actions are observed. Another extension is the 
"tracking" problem when, instead of competing with the best 
constant vector of actions, the decision maker intends to per- 
form as well as the best strategy that is allowed to switch a 
certain limited number of times (but always satisfying the re- 
strictions). We also consider alternative global loss functions 
that do not necessarily sum the losses over the tasks. Finally, 
we describe a setting in which there are infinitely many tasks 
indexed by an interval. This is a natural extension of the 
main examples we work with and the algorithmic problem 



has some interesting connections with exact simulation of 
continuous-time discrete Markov chains. 

3 Motivating examples 

We start by describing four examples that we will be able 
to handle with the proposed machinery. The examples are 
defined by their corresponding sets A C X^' of permitted 
simultaneous actions. 

Example 1 (Internal coherence) Assume that tasks are lin- 
early ordered and any two consecutive tasks, though differ- 
ent, share some similarity. Therefore, it is a natural require- 
ment that the actions taken in two consecutive games be not 
too far away from each other. One may also interpret this as a 
matter of internal coherence of the decision maker. To model 
this, we assume that the actions are ranked in the action set X 
according to some logic and impose some maximal dissim- 
ilarity 7 > between the actions of two consecutive tasks, 
that is. 



7 



Example 2 (Escalation constraint) Once again we assume 
that the tasks are linearly ordered and the actions are ranked. 
Imagine that tasks correspond to consumers and that the higher 
the index of the task, the more favorable the conditions for 
the consumer (and the higher the loss of earnings of the 
seller, who is the decision maker). The constraint decision 
maker has to satisfy is that higher-ranked costumers need to 
receive better conditions, at least within the same round of 
play. That is, the simultaneous actions must form a non- 
decreasing sequence in the following sense. 



A = 



Example 3 (Constancy constraint) Assume that tasks are 
ordered and that the decision maker should not vary its ac- 
tion too often. This is measured by the fact that the decision 
maker must stick to an action for several consecutive tasks 
and that he can shift to a new action only at a Umited number 
TO of tasks, which we model by 



A 



[Xki, 



M-l 



^ TO 



Example 4 (Budget constraint) Here we assume that the 
number x^j associated to action k in task j represents the 
cost of choosing this action. The freedom of the decision 
maker is limited by a budget constraint. For example, one 
may face a situation when the decision maker has a constant 
budget B to be used at each round, that is. 



A 



■ : ^k^ 



M 



s$ B 



To make things more concrete, we assume, in this example 
only, that Xk = k. One should then take for B as an integer 
between M and NAI. For smaller values A becomes empty 
and for larger values A = . 



4 Exponentially weighted averages 

By considering each element of ^ as a (meta-)expert, we 
can reduce the problem to the usual single-task setting and 
exhibit a forecaster with a good performance bound that, in 
its straightforward implementation, has a computational cost 
proportional to the cardinality of A. 

More precisely, for each round n ^ 1, we denote by 

n 

L„(x)=^^(x,yt) 

t=i 

the cumulative loss of the simultaneous actions x G X, and 
define an instance of the exponentially weighted average fore- 
caster on these cumulative losses. That is, at round < = 1, the 
decision maker draws an element Xi uniformly at random in 
A and for each round t ^ 2, draws Xt at random according 
to the distribution pt on A which puts the following mass on 
each X e ^, 



exp( 



Eae^exp(-77Lt_i(a)) ' 



(1) 



where 77 > is a parameter to be tuned. The bound follows 
from a direct application of well-known results, see, for in- 
stance, IICBL061 Corollary 4.2]. 

Proposition 1 For all n ^ 1, the above instance of the ex- 
ponentially weighted average forecaster, when run with rj = 
(l/M) y^8(hi N) / n, ensures that for all S > 0, its regret is 
bounded, with probability at most 1 — 5, as 



Rn ^ M 



where \A\ denotes the cardinality of A 

The computational complexity of this forecaster, in its 
naive implementation, is proportional to |^|, which is pro- 
hibitive in all examples of Section [3] since the cardinality of 
A is exponentially large. For example, in Example [T] if we 
denote by 




{x' e X : 



X e X 



a common lower bound on the number of 7-close actions to 
any action in X, then 

\A\ ^ Np^'-^ . 

In Example |2] by first choosing the to actions to be used (in 
increasing order) and the m — 1 corresponding shift points, 
one gets 



1-41 



N 

E 

rn— 1 

E 

m— 1 



N 
ni 

N 
m 



M + TO - 1 

TO — 1 



N 



[M + 1) 
(to - 1)! (A^- 1)! 



In the case of at most to shifts in the simultaneous actions, 
discussed in Example[3] we have 



\A\-^ 



M + 771 



N{N-iy 



(where the lower bound is obtained by considering only the 
simultaneous actions with exactly m shifts). That is, |^| is of 
the order of {M N)"^/ml. Finally, with the budget constraint 
of ExamplelH the typical size of A is exponential in M, as 

1^1 ^ P'' 

where p ~ [B/AI\ is the lower integer part of B/M. 

5 Efficient implementation with online 
shortest path 

In this section we show how the computational problem of 
drawing a random vector of actions Xf e ^ according to the 
exponentially weighted average distribution can be reduced 
to the well-studied online shortest path problem. Recall that 
in the online shortest path problem (see, e.g., ITW04 TjLL04I 
IGLL05] ) the decision maker selects, at each round of the 
game, a path between two given vertices (the source and the 
sink) in a given graph. A loss is assigned to each edge of the 
graph in every round of the game and the loss of a path is 
the sum of the losses of the edges. A path can be selected ac- 
cording to the exponentially weighted average distribution in 
a computationally efficient way by a dynamic programming- 
type algorithm, see fTW04l or f CBLOei Section 5.4]. The 
algorithm has complexity 0(|f |) where £ is the set of edges 
of the graph. 

We first explain how the problem of drawing a joint ac- 
tion in the multi-task problem can be reduced to an online 
shortest path problem in all the examples presented above 
and then indicate how to efficiently sample from the distri- 
bution pt defined in ([T]). 

5.1 A Markovian description of the constraints 

In order to define the corresponding graph in which the on- 
line shortest path problem is equivalent with our hard-con- 
strained multi-task problem, we introduce a set S of hidden 
states. The value of the hidden state controls that the hard 
constraints are satisfied along the sequence of simultaneous 
actions. To this end, denote by S the state function, which, 
given a vector of actions (of length ^ M), outputs the corre- 
sponding state in S. 

We also consider an additional state * meaning that the 
hard constraint is not satisfied. We denote S* — SU {*}. By 
definition, 

= {x e X^' : 5(x) ^ *} . 

To make things more concrete we now describe S and S on 
all four examples introduced in Section[3] 

The first two examples are the simplest as all the infor- 
mation is contained in the current action; their hidden state 
space S is reduced to a single state ok. For Example[T] for all 
sequences (^Xk^ , ■ ■ ■ , ) of length 1 ^ j ^ M, one defines 



OK 



if for all z ^ j — 1 , 
otherwise, 



< 



whereas for Example |2] 

• • ■,Xk^)^ 



f OK if for all i ^ j — 1, x^. ^ Xk^^^ , 
1 ★ otherwise. 

In Example[3]the underlying hidden state counts the num- 
ber of shifts seen so far in the sequence of actions, so S = 
{0, . . . , m} and for all sequences (xfej , . . . , Xkj ) of length 
less or equal to M, we first define 

S'[{xk^,...,Xk^)) = J2 hk,^k,+,} 

and then 

5'((a;fci,...,a;fcj) 

S'(^{xk^,...,Xkj)'j ifS'(^{xk^,...,Xk^)^^m, 
I * otherwise. 

Finally, in Example ID the hidden state monitors the budget 
spent so far, that is, 5 = {0, ... , B}, 

j 



1=1 



and 



S 



,Xk,^ 



^B, 



otherwise. 



In view of these examples, the following assumption on 
S is natural. 

Assumption 1 The state function is Markovian in the fol- 
lowing sense. For all j ^ 2 and all vectors (^Xki , • ■ ■ , Xk^), 

the state s{jyXk^, ■ . ■ ,Xkj)^ only depends on the value of 
Xkj and on the state s{JyXk^^ . . ■ ,Xfe^_j)^. 

We further assume that there exists a transition function 
T that, to each pair (x, s) (corresponding to some task j) 
formed by an action x ^ X and a hidden state s e 5*, 
associates pairs {x\ s') G A" x 5 (to be used in task j + 1). 
Put differently, T((a;,s)) is a subset of A" x 5* that indicates 
all legal transitions. We impose that when the prefix of a 
sequence is already in the dead end state s — -k, the whole 
sequence stays in *, that is, for all x € X, 

r((x,*)) = A X {★} . 

Once again, to make things more concrete, we describe T for 
the four examples introduced in Section[3] 

Example[T]relies on 5 = {ok} and the transitions 

T{{x, ok)) = (a n [a; - 7, a; + 7]) X {ok} 



for all X £ X. Example |2] can be modeled with S — {ok} 
and the transitions 

T((x, ok)) ~ [x,xn] X {ok} . 

for all X E X. 

For Example[3] the transition function is given by 

T{{x, s)) = {{x, s)} U ((A- \ {x}) X {s + 1}) 

for all s = 0, .... m — 1 and 



T((x, m)) = {(.T, m)} U ((A" \ {x}) x {*} 

for s = m. 

Finally, the one of Example |4] is given by 

rj.f, X X {s + x} if S + X!^ B, 

1 {{X,S}) - I ^ X {★} if S + .T > B. 



5.2 Reduction to an online shortest path problem 

We are now ready to describe the graph by which a con- 
strained multi-task problem can be reduced to an online short- 
est path problem. Assume that A is such that there is a 
corresponding state space S, a state function 5* satisfying 
Assumption [T] and a transition function T. We define the 

cumulative losses L„ suffered in each task j = 1, . . . , Af 
between rounds t ^ 1 and n as follows. For all x £ X, 



Of course, with the notation above, for all n ^ 1 and all 



in(x) 



M 
J = l 



In the sequel, we extend the notation by convention to rt = 0, 
by Lo = and = for all j. 

Then, for each round < = 1, . . . , n, we define a directed 
acyclic graph with at most MN\S\ vertices. Each vertex 
corresponds to task-action-state triple (j, Xk, s), where j — 
1, . . . , Af, k = 1, . . . ,N, and s G S. Two vertices v = 
(j, Xk , s) and v' — {j' , Xk/ , s') are connected with a dkected 
edge if and only if j' = j + 1, and {xk',s') G T{xk,s), 
that is, {xk, s) {xk' , s') is a legal transition between tasks 
j and j + 1- The loss associated to such an edge equals 

L^iXixk'), the cumulative loss of action Xk' in task j' in 
the previous time rounds. We also add two vertices, the 
"source" node uq and the "sink" ui as follows. There is 
a directed edge between uq and every vertex of the form 
{l,Xk,s) with k = 1, . . . ,N and s ^ -k. Its associated losses 

equal L\^-^{xk)- Finally, every vertex of the form (Af, Xk, s) 
with k — 1, . . . ,N and s 7^ ★ is connected to the sink ui 
with edge loss 0. 

In the graph defined above, choosing a path between the 
source and the sink is equivalent to choosing a legal 7\f-tuple 
of actions in the multi-task problem. (Note that there is no 
path between uq and ui containing a vertex with s = 
The sum of the losses over the edges of a path is just the 



cumulative loss of the corresponding Af-tuple of actions. 
Generating a legal random Af-tuple according to the expo- 
nentially weighted average distribution is thus equivalent to 
generating a random path in this graph according to the ex- 
ponentially weighted average distribution. This can be done 
with a computational complexity of the order of the number 
of edges defined above, see, e.g., fCBL06^ Section 5.4]. In 
our case, since edges only connect two consecutive tasks, the 
number of edges is at most 1 + AIN^\S\'^. In Section |5".3.1| 
we discuss the number of edges and the related complexity 
on the examples of Section |3] 

Since edges only exist between consecutive tasks, the 
above implementation by reduction to an online shortest path 
problem takes a simple form, which we detail below for con- 
creteness. It will be useful to have it for Section [8^ 



5.3 Brief recall of the way the efficient implementation 

goes 

In order to generate a random Af-tuple of actions according 
to the distribution pf, we first rewrite the probability distri- 
bution pf in terms of the state function S and the cumulative 

(i) 

losses Lj._i suffered in each task j. To do so, we denote by 

(5x the Dirac mass on x = (xfcj , . . . , Xk^^ ) , that is, the prob- 
ability distribution over X that puts all probability mass on 
X. The definition ([TJ then rewrites as 



Pt 



E 



ll{s(x)#*} cxp 



(i) 



(2) 



Before proceeding with the random generation of vectors 
Xt according to pt, we introduce an auxiliary sequence of 
weights and explain how to maintain it. For all rounds t ^ 0, 
tasks j e {!,..., Af }, actions x £ X, and states s G 5, we 
define 

E exp[-JiW(x)+EifH^^j)) 



xl 



Note that we do not consider the state -k here. 

Now, for all rounds t ^ 0, actions x G X, and states 
s Q S, one simply has 

wt,i,x,s = exp (^-r]L[^\x)j l{s{x)=s} ■ 

Then, an induction (on j) using Assumption[T]shows that for 
all 1 ^ j < Af — 1, actions x' G X, and states s' e S, 

'Wt,j+l^x' ,s' = 

E '>^t,j,x.sl{(x'^s')€T{(x,sm exp(^-r]L'f'^^\x')^ . 



(3) 



We now show how to use these weights to sample from 
the desired distribution pt, for t ^ 1. We proceed in a back- 
wards manner, drawing first XM,t, then, conditionally to the 
value of t, generating Xm-h, and so on, till Xi t. 

To draw Xm.t, we note that equation ^ shows that the 
M-th marginal induced by pf is the distribution over X that 
puts a probability mass proportional to 



E 

ses 



Wt-l,M,k,s 



on each action x ^ X. It is therefore easy to generate a 
random element Xma with the appropriate distribution. We 
actually need to draw a pair {XM,t, SM.t) E X x S dis- 
tributed according to the distribution on X x S proportional 
to the Wt-i,M.k.s- 

We then aim at drawing the actions (and hidden states) 
corresponding to the previous tasks according to the (con- 
ditional) distribution pt{ - \XM.t, S^i.t) on {X x S)^^~^. 
Again by using the Markovian assumption on S, it turns out 
that the {M — l)-th marginal of this distribution on X x S 
is proportional, for all pairs {x, s) E X x S,to 

Wt,M-l,x,s h{XM.t,SM,t)eT{{x,sm ■ 

This procedure, based on conditioning by the future, can be 
repeated to draw conditionally all the actions Xi^t, X2,t, ■ ■ ■ , 
XM,t and hidden state spaces Si,t, 82,1, ■ ■ ■ , SM,t- In partic- 
ular, we use, to draw Xj^t and Sj^t, the distribution onX x S 
proportional to 



(4) 



The realization Xt = {Xi,t, X2.t, ■ ■ ■ , Xma) obtained this 
way is indeed according to the distribution pf. 

5.3.1 Complexity of this procedure for the considered 
examples 

The space complexity is of the order of at most 0[MN\S\), 
since weights have to be stored for all ask-action-state triples. 
The computational complexity, at a given task, for perform- 
ing the updates ^ for all x' and s' is bounded by the number 
of pairs (x', s') times the maximal number of paks (x, s) that 
lead to (x', s'). We denote by Tmax this maximal number of 
transitions. Then, the complexity of performing ^ for all 
tasks is bounded by O (M N\S\T^!^^) . The complexity of 
the random generations (|4]l is negligible in comparison, since 
it is of the order of 0{MN\S\). 

We now compute Tmax for the four examples described 
in Section |3] and summarize the complexity results (both for 
the efficient and the naive implementations) in the table be- 
low. In Example [T] in addition to the parameter p introduced 
in Section H) we consider a common upper bound on the 
number of 7-close actions to any action in X, 



7? 



{x' ex : \x-x'\ 7} 



X e X 



Then, T^ax — In Example |2] the value Tmax — N is 
satisfactory. In Example[3] only Tmax = X pairs {x, s), of 
the form x = x' and s = s' or x 7^ x' and s' = s + 1, can 
lead to (x', s'). A similar argument shows that in the case 
of Example]?] only T„iax = N such transitions are possible 
also. 
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Mm 
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^ {M + i)^/{N-iy. 
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MN^m 




m 


MN^B 





6 Tracking 

In the problem of tracking the best expert of I HW98I IVov99l , 
the goal of the forecaster is, instead of competing with the 
best fixed action, to compete with the best sequence of ac- 
tions that can switch actions a limited number of times. We 
may formulate the tracking problem in the framework of 
multi-task learning with hard constraints. In this case, just 
like before, at each time t, the decision maker chooses an 
M-tuple of actions from the set A of legal vectors. How- 
ever, now regret is measured by comparing the cumulative 
loss of the forecaster X^tLi K^t-Yt) with 
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mm 



where (A) is the set of all sequences of vectors of A that 
may switch values at most K times (i.e., the time interval 
1 . . . , n can be divided into at most K + 1 intervals such that 
over each interval the same A/-tuple of actions). In this case 
it is well known that exponentially weighted average over the 
class Y,k{A) of meta-experts (see |CBL06, Sections 5.5 and 
5.6] for a statement of the results and precise bibliographic 
references) yields a regret 
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which holds with probability 1 ~ S. Moreover, the complex- 
ity of the generation of the A/-tuples of actions achieving 
the regret bound above is bounded, at round t, by O (t^ + 

MN^\S\^Kt). 

7 Multi-task learning in bandit problems 

In this section we briefly discuss a more difficult version of 
the problem when the decision maker only observes the to- 
tal loss ^(Xt, yt) suffered though the M games but the se- 
quence yt of outcomes remains hidden. This may be consid- 
ered as a "bandit" variant of the basic problem. 

Then our problem becomes an instance of an online lin- 
ear optim ization prob lem studied by liAK04llMB041IGLLO07l 
IDHKO8I IAHRO8 BD H+08IICBL09L For example, since the 
dimension of the underlying space is given by the number of 
edges, in number always less than 1 + MN'^\S\'^, the re- 
sults of |DHK08 | imply that a variant of the exponentially 
weighted average predictor achieves an expected regret of 
the order 
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= O (aI (m^'^N^\S\^ + N\S\\fM\u \A\) 



IIBDH+081 proved that an appropriate modification of the 
forecaster satisfies this regret bound with high probability. 
As the predictor of |DHK08 1 requires exponentially weighted 
averages based on appropriate estimates of the losses, it can 
be implemented efficiently with the methods described in 
Section |5] More precisely, it first computes, at each round 
t, estimates of all losses £^^^ {x,yj^t), when x £ S and j — 
1, . . . , Af and then can use the methods described in Sec- 
tion |5] The computationally most complex point is to com- 
pute these estimates, which essentially relies on computing 
and inverting an incidence matrix of size bounded by the 
number of edges. This can be done in time 0(^AI^N*\S\'^^ . 
Details are omitted. 

8 Other measures of loss 

In this section we study two variations of the multi-task prob- 
lem in which the loss of the decision maker in a round is 
computed in a way different from summing the losses over 
the tasks, consisting in computing in a different manner the 
total loss incurred within a round on the M tasks. IIDLS07I 
measure losses by different norms of the loss vector across 
tasks but they do not consider the hard constraints introduced 
here. 

8.1 Choosing a subset of the tasks 

In our first example, at every round of the game, the fore- 
caster chooses TO out of the M tasks and only the losses 
over the chosen tasks count in the total loss. For simplic- 
ity we only consider the full-information case here when the 
decision maker has access to all losses (not only those that 
correspond to the chosen tasks). 

Formally, we add an extra action — which means that 
the decision maker does not play in this task. Of course, 
^^''^ (~i y) — for all j and y G 3^j. We model this by 

M 1 

j=i J 

Since an element of A is characterized by the to tasks 
(out of M) in which it takes one among the N actions of X, 
we have 

Here again, the bound of Proposition [T] applies and an effi- 
cient implementation is possible as in Section |5] at a cost of 
0{MN^m^). 

Of course, additional hard constraints could be added in 
this example. 

8.2 Choosing a different global loss 

This paragraph is inspired by fDLSO?] where a notion of 
a "global loss function" is introduced. The loss measured 
£(X.t, yt) in a round is now a given function of the losses 
i^^^ {Xj,t,yj,t) incurred in each task j, which may be differ- 
ent from their sum, 

M,t, yM,t) I ■ 



Examples include for instance the max-loss or the min loss, 

ip{ui, . . . ,um) = max{ui, . . . ,ua/} 

or , . . . , um) = min{ui, . . . , um} , 

whenever one thinks in terms of the best or worst perfor- 
mance. 

We make a Markovian assumption on the losses. More 
precisely, we assume that they can be computed recursively 
as follows. There exists a function (p on such that, defin- 
ing the sequence {v2, ■ ■ ■ , vm) as 

V2 = ip{ui,U2) and = <^(ut_i, ut) for i ^ 3 , 

one has 

VM = ■ ■ • ,"m) ■ 

This means that if the values vt are added as a hidden state 
space V, and if the latter is not too big, computation of the 
distributions pi defined, for all rounds t ^ and all simulta- 
neous actions x e ^, by 

exp (-?7E!=l^(x,yt)) 

PtW = ^-7 — '—^ , 

Eae^exp (^-77E.=i^(a,yt)j 

can be done efficiently (a statement which we will be made 
more precise below). In addition, it is immediate, by reduc- 
tion to the single-task setting, that a regret bound as in Propo- 
sition [T] holds, where one simply has to replace M with the 
supremum norm of ijj over the losses. 

We only need to explain how and when the results of 
Section |53] extend to the case considered above. The state 
V of possible values for the possible sequences of vt should 
not bee too large and the update ^ has to be modified, in the 
sense that it is unnecessary to multiply by the exponential of 
the losses; the global loss will be taken care of at the last step 
only, its value being tracked by the additional hidden space. 
The complexity is of the order of at most 0[MN^\S\^\V\'^) . 
Examples of small | V| include the case when the global loss 
is a max-loss or a min-loss and the case when all outcome 
spaces yj and loss functions are identical. In this case, 
|V|=iV. 

Note that here, in addition to this change of the measure 
of the total incurred in a round, additional hard constraints 
can still be considered, since the base state space S is de- 
signed to take care of them. 

9 Multi-task learning with a continuum of 
tasks and hard constraints 

In this section we extend our model by considering infinitely 
many tasks. We focus on the case when tasks are indexed by 
the [0, 1] interval. We start by describing the setup, then pro- 
pose an ideal forecaster whose exact efficient implementa- 
tion remains a challenge. We propose discretization instead, 
which will take us back to the previously discussed case of a 
finite number of tasks. 

9.1 Continuum of tasks with a constrained number of 
shifts 

Assume that tasks are indexed by 5 e [0, 1]. The decision 
maker has access to a finite set X = {a;i, . . . , x^} of ac- 
tions. Taking simultaneous actions in all games at a given 



round t is now modeled by choosing a measurable function 

It-.ge [0,1]^ Itig)eX . 

The opponent chooses a bounded measurable loss function 
4>t ■ [0, 1] X A" — > [0, 1]. The loss incurred by the decision 
maker is then given by 

"'[0,1] xeA'"'{^*=^} 

As before, we require that the action of the decision maker 
satisfies a hard constraint. One case that is easy to formulate 
is, that It must be right-continuous and the family of actions 
taken simultaneously, 

(^t(.9))ge[o,i] 

must contain at most a given number m of shifts, where by 
definition, there is a shift at g if for all e > 0, the set It{[g — 
e,g\) contains more than two actions. We denote by A the 
set of such simultaneous actions. Actually, any element of A 
can be described by its shifts (in number at most m), denoted 
by 5i, . . . , gm', with m! ^ to, and the actions taken in the 
intervals [gj,gj+i [ for all j = 0, . . . , to' — 1 where g^ — 0, 
and on [gm' A]- 

The aim of the decision maker is to minimize the cumu- 
lative regret 



t=i 



t=l 



where the /( are picked from A. 
9.2 An ideal forecaster 

We denote by fj, the distribution on A induced by the uniform 
distribution on x [0, 1]™ via the mesurable application 



,gr, 



(5) 



where we denoted by [g{i) , ■ • ■ , 5(m)) the order statistics of 
the gi, . . . ,gjn- (It is useful to observe for later purposes that 
if Gi , . . . , Gm are i.i.d. uniform, then the vector 

V{Gi, . . . , Gm) 

= (G'(i), G(2) — G(i), . . . , G(m) — G(m_i), 1 - G(„)) 

(6) 

is uniformly distributed over the simplex of probability dis- 
tributions with TO + 1 elements.) 

For all t ^ 1, the ideal forecaster uses probability distri- 
butions pt over A, defined below, and draws the application 
It giving the simultaneous actions to be taken at round t ac- 
cording to pt. For i = 1, we take pi = /i. For t ^ 2, we 
take Pt as the probability distribution absolutely continuous 
with respect to fj, and with density 



dpt(/) 



exp [-vT,l=\is{I) 



■ d^l{I). (7) 



The performance of this forecaster may be bounded as fol- 
lows. Note that no assumption of continuity or convexity is 
needed here. 

Theorem 2 For all n ^ 1, the above instance of the expo- 
nentially weighted average forecaster, when run with 



V = 



/8(to + l)ln(iVV^) 



ensures that for all S > 0, its regret is bounded, with proba- 
bility at most 1 — 5, as 



(to + 1) ln(7VV") ^ _L /" J 1 



Proof: By the Hoeffding-Azuma inequality, since the V't 
take bounded values in [0, 1], we have that with probability 
at least 1 — 5, 



It n It ^ 

Rn^Y. AU)dpt(/)-mf ^^t(/) + J|ln-. (8) 

J. — 1 'J j\ + — 1 * 



We denote, for alH ^ 1 



Wt 



^exp (^vYl^ts{I)^ d/i(/) 



(with the convention Wq = 1). The bound on the differ- 
ence in the right-hand side of ^ can be obtained by upper 
bounding and lower bounding 



lnW„ 



t-i 



The upper bound is obtained, as in IICBL06I Theorem 2.2], 
by Hoeffding's inequality, 



In 
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Wt-i 
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A lower bound can be proved with techniques similar to the 
ones appearing in BBK97I . see also |CBL06, page 49]. We 
denote by /* the element of A achieving the infimum in the 
definition of the regret (if it does not exist, then we take an 
element of A whose cumulative loss is arbitrarily close to 
the infimum). As indicated in Section 19.11 /* can be de- 
scribed by the (ordered) shifting times g\, . . . , g* j and the 
corresponding actions x^* , ■ ■ ■ , Xk' . We denote by A the 
Lebesgue measure. We consider the set of the simultaneous 
actions / that differ from /* on a union of intervals of total 
length at most e > 0, for some parameter e > 0, 

Aein ^ {I : X{I ^ n ^ e} . 

Ae {I* ) contains in particular the / that can be described with 
the same ni+1 actions as /* and for which the shifting times 
(7i , . . . , (7m are such that 

m 

i.e., the 7 for which the corresponding probability distribu- 
tion V{gi, . . . , gm) defined in (|6ll is £-close in ^^-distance 



to V[gl, . . . , (7m). Because /i induces by construction, via 
the application V, the uniform distribution over the simplex 
of probability distributions over m + 1 elements, we get, by 
taking also into account the choice of the fixed m + 1 actions 
of/*. 

Here, we used the same argument as in IIBK97I , based on ob- 
serving the fact that the uniform measure of the e-neighbor- 
hood of a point in the simplex of probability distributions 
over d elements equals e''^^. In addition, because the tpt 
take values in [0, 1], we have, for all / e Ae{I*) and all 
s > 1, 

< 4(r ) + A{/ ^ r } 4(r ) + e . 

Putting things together, we have proved 

= In^cxp (^-vf^isil)^ MI) 
^ In (^^{Aeil*)) ^^P (^-V (^en + f2 £,{!*) 

n , 

Combining the upper and lower bounds on In Wn and sub- 
stituting the proposed value for r] concludes the proof. ■ 



Efficient implementation in this context requires exact 
simulation of a step function / according to (|7|, that is, from 
the distribution 

dpt(/) a exp(^-77^ (^t_i(5,/(<?))d.g^ Ay.{I) (9) 

for the functions defined, for each x e A", as 



which take values in [0,< — 1]. One could simulate from 
(|9|l by rejection sampling proposing from /i; the probability 
of acceptance is bounded below by something of the order of 



in view of the value of r\. Therefore, the computational 



cost of such an algorithm, although only linear in m and iV, 
would be typically exponential in i, hence unappealing. 

Note that the problem (at each round t) can be repre- 
sented as a discrete-time Markov model. The Markov chain 
Z is given by the pairs formed by the shifting times and 
their corresponding actions, Zj — (Gjj) , i^j+i), for j 



and with the convention G(o) — 0. Let tt denote 



0, 

the law of this Markov chain when the times Gi , . . . , Gm are 
i.i.d. uniform over [0,1] and the action indexes K\....^ Km+i 
are taken i.i.d. uniform in {1, . . . , N}. Then simulating / ac- 
cording to (|9]l is equivalent to simulating Z according to the 
distribution 



1' ^(i-i)' ^(i)) d7r(Z) 



where, for g ^ g', 

Wj{k,g,g') = cxp ^-?/ j </?t„i(M, x^) duj , 

Exact simulation from Tit-i is feasible when the state-space 
of Z is finite, and consists, e.g., in the same type of dynamic 
programming approach discussed in Section |5] However, 
this is not the case here, since the second component of Zj 
takes values in [0, 1]. Approximating the state-space of Z by 
a grid is a possibility for an approximate implementation, but 
it will be typically less efficient than the approximation we 
advocate in Section |93] 

An interesting alternative is to resort to sequential Monte 
Carlo methods (broadly known as particle filters, see for ex- 
ample IDdEGOl] for a survey). This is a class of meth- 
ods ideally suited for approximating Feynman-Kac formu- 
lae; a concrete example is the computation of expectations 
of bounded functions with respect to the laws 7rt_i defined 
above. This is achieved by generating a swarm of a given 
large number of weighted particles. The generation of parti- 
cles is done sequentially in j = 1, . . . , m + 1 by importance 
sampling, and it involves interaction of the particles at each 
step. This generates an interacting particle system whose sta- 
bility properties are well studied (see, for instance, |DM04|). 
Resampling a single element from the particle population ac- 
cording to the weights gives as an approximate sample from 
TTt-i, hence from The total variation distance between 
the approximation and the target is typically G(m + 1)/K, 
for some constant G depending on the range of the inte- 
grands. In the most naive implementation in this context, 
one might thus have that G is exponentially small in t/m. 
The idea of an on-going work would be to make G indepen- 
dent of t by carefully designing the importance sampling at 
each step taking into account the characteristics of the ipt-i- 



Below we use a simple discretization and apply the tech- 
niques of previous sections to achieve approximate sampling 
from (|7J. 

9.3 Approximate generation by discretization 

Here we show how an approximate version of the forecaster 
described above can be implemented efficiently. 

The argument works by partitioning [0, 1] into intervals 
G" = [0, Gi = [1/e, 2/e[, . . ., G*^= of length e (ex- 
cept maybe for the last interval of the partition), for some 
fixed £ > 0, and using the same action for all tasks in each 
G^ . Here, we aggregate all tasks within an interval G-' into 
a super-task j. We have M = = of these super- 

tasks and will be able to apply the techniques of the finite 
case. 

More precisely, we restrict our attention to the elements 
of A whose shifting times (in number less or equal to m) are 
starting points of some G-', that is, are of the form j/e for 
^ j ^ Me- We call them simultaneous actions compati- 
ble with the partitioning and denote by the set formed by 
them. The loss of super-task j at time t given the simulta- 
neous actions described by the element I G is denoted 
by 

/ iPt{gJij/e))dg. 

JGi 



Note that these losses satisfy £[^^ (/) e [0, e]. 

By the same argument as the one used in the proof of 
Theorem|2l we have 

n n 

inf y^t(/) sC inf ye^{I) + !I^. 
t=i t=i 
This approximation argument, combined with Proposition[T] 
and the resuhs of Section |5] leads to the following. (We use 
here the fact that there are not more than 



TO 



elements in Be 



Theorems For all e > 0, the weighted average forecaster 
run on the Mg super-tasks defined above, under the con- 
straint of not more than m shifts, ensures that for a proper 
choice of r] and with probability at least 1 — ^, the regret is 
bounded as 



/nTOln(iV[l/e]) 



rane 



n 1 



2 2 
In addition, its complexity of implementation is 0{{Nm)'^ / e) 



The choice of e of the order of 1/ ^/n yields a bound 
comparable to the one of Theorem|2] for a moderate compu- 
tational cost of 0{^/n{NmY'). 

These results can easily be extended to the bandit setting, 
when -tpt is only observed through It as 



[04] 



i^t{g,lt{g))dg 



This is because whenever /( is compatible with the partition- 
ing, the latter is also the sum of the losses of the actions taken 
in each of the super-tasks. The techniques of Section [T] can 
then be applied again. 
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